System and method for processing partially unstructured data

ABSTRACT

A system and method for processing partially unstructured data relating to a financial security. The system and method resolve first- and second-identifying data from the partially unstructured data and determine whether a security is defined by the first-identifying data and the second-identifying data. Additionally, the system and method resolve trade information relating to the security identifier from the partially unstructured data. If a security is defined by the resolved identifying data, a security identifier representing the defined security, along with the trade information relating to the defined security, are output.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. patent application Ser. No.10/740,058, filed Dec. 18, 2003, which claims the benefit of U.S.Provisional Patent Application No. 60/511,591, filed Oct. 15, 2003.These applications are incorporated by reference herein in theirentirety.

REFERENCE TO COMPUTER PROGRAM LISTING APPENDIX

This patent application includes a computer program listing appendixsaved as a text file named “appendix1.txt”, which is being submittedherewith via EFS-Web. The computer program listing provided in the textfile named “appendix1.txt” is an exact copy of the computer programlisting provided in the text file named “appendix1.doc”, which wascreated on Nov. 7, 2003 and submitted with parent U.S. patentapplication Ser. No. 10/740,058. The text file named “appendix1.txt” is60,820 bytes in size. This computer program listing appendix is herebyincorporated herein by reference.

FIELD OF THE INVENTION

This invention relates to a system and method for processing partiallyunstructured data to extract valuable information from the partiallyunstructured data. In particular, this invention relates to processingpartially unstructured data, such as text, to extract information ofinterest, such as information relating to the trading of securities.This invention enables traders of securities to access a higher quantityof trade information than they would ordinarily be able to access.

BACKGROUND OF THE INVENTION

For the trader of securities, it is very important to know what the bestavailable prices are on the street in a timely manner and to be able touse the trading opportunities that these prices present before thewindow of opportunity closes. Nowhere is this more important than forbond trading. Typically, a Credit Default Swap trader receivesinformation about bond prices in the form of emails. The bulk of theseemails arrive within a very short period of time around the time whenthe markets open, and the information contained within these emails isvaluable only for a limited period of time. It is common for traders toreceive hundreds of these emails in the morning. Buried within theseemails are often good trading opportunities.

In the conventional arrangement, the trader had to manually read througheach of these emails to find out what the prevailing bond prices arebeing offered on the street. However, the trader often cannot readthrough all of these emails before the window of opportunity closes fortaking advantage of the information in these emails. For every email thetrader does not have time to read, he or she misses an opportunity toearn a profit.

Further, no rigid formatting convention for these types of emailsexists. They are fairly unstructured and often differ significantly fromone-another. For example, an email may have lines talking about animpending vacation and then may have lines stating, “by the way, I wantto sell this particular bond at this particular price.” Also, the emailmay or may not provide all of the information commonly used to identifya particular bond. Therefore, lack of consistent formatting in emailspresents a technical problem for extracting trading opportunityinformation from such emails with a relatively high rate of success.

SUMMARY OF THE INVENTION

These problems are addressed and a technical solution achieved in theart by this invention, which provides a system and method for processingpartially unstructured data relating to financial securities. Inparticular, this system and method resolve first-identifying data fromthe partially unstructured data, resolve second-identifying data fromthe partially unstructured data, and determine whether a security isdefined by the first-identifying data and the second-identifying datawhen the second-identifying data is of a predetermined type. The systemand method also resolve third-identifying data from the partiallyunstructured data and determine whether a security is defined by thefirst-identifying data, the second-identifying data, and thethird-identifying data. Additionally, the system and method resolvetrade information relating to the security identifier from the partiallyunstructured data. If a security is defined by the first- andsecond-identifying data, or by the first-, second-, andthird-identifying data, a security identifier representing the definedsecurity is output along with the trade information relating to thesecurity. Optionally, it is determined whether a security isunambiguously defined by the identifying data. In one embodiment, thefirst-identifying data represents a ticker, the second-identifying datarepresents a coupon or a maturity, the third-identifying data representsthe other of a coupon or a maturity that the second-identifying datarepresents, and the predetermined type is a maturity.

Described in a different manner, the system and method identify at leastone of a plurality of predefined data vectors from partiallyunstructured data. The partially unstructured data includes a pluralityof data items having positions relative to each other in the partiallyunstructured data. The system and method determine a position of each ofone or more data items of a first type from the plurality of data itemsin the partially unstructured data. A data item of a second type isselected from the plurality of data items in the partially unstructureddata. The system and method also select one of the one or more dataitems of the first type based on its position relative to the selecteddata item of the second type. A data item of a third type and a dataitem of a fourth type are selected from the plurality of data items inthe partially unstructured data. The system and method identify apredefined data vector from the plurality of predefined data vectorsfrom the selected data item of the first type, the selected data item ofthe second type, and the selected data item of the third type. The dataitem of the fourth type and an identifier representing the identifieddata vector are output. Examples of data items of a first, second,third, and fourth type are a ticker, coupon, maturity, and tradeinformation, respectively. Alternate examples of data items of a first,second, third, and fourth type are a ticker, maturity, coupon, and tradeinformation, respectively. An example of an identifier is a CUSIP. Thedata items of the first, second, third, and fourth type, along with theidentifier, may be stored in a context.

This invention provides a technical solution in that it processes thevast quantity of emails that a trader receives in the morning, andextracts from many of them, the identities of the securities, such asstocks and/or bonds, and trade information relating to each of theidentified securities, such as bid and/or offer prices. The extractedinformation is then accessible to the trader in the morning when themarkets open, the time period when it is needed. The invention providesmuch more information regarding prevailing bond prices than wouldnormally be available if the trader has to manually read through each ofthe emails.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of this invention may be obtained from aconsideration of this specification taken in conjunction with thedrawings, in which:

FIG. 1 is an example of a hardware arrangement implementing thepreferred embodiment; and

FIGS. 2-5 are flowcharts depicting the major processing steps performedby the preferred embodiment.

DETAILED DESCRIPTION I. Definitions

Prior to discussing the details of the preferred embodiment, severaldefinitions of terms used throughout this specification are set forthbelow.

1) Ticker: a system of letters used to uniquely identify a stock ormutual fund.

2) Coupon: the interest rate stated on a bond when it's issued. Alsoreferred to as “Rate.”

3) Maturity: The length of time until the principal amount of a bondmust be repaid.

4) Bid: Price at which to buy a security. A bid is considered a type oftrade information relating to the security.

5) Offer: Price at which to sell a security. Also referred to as “Ask.”An offer is considered a type of trade information relating to thesecurity.

6) Token: a segment of data that can represent one of (1) a security'scoupon, (2) maturity, or (3) bid and/or offer.

7) CUSIP Number or CUSIP: A number used to identify all U.S. andCanadian stocks and registered bonds. (“CUSIP” is a registered trademarkof the American Bankers Association,) A security's CUSIP can beidentified by its ticker, coupon, and/or maturity. Therefore, a ticker,coupon, and maturity are types of identifying information used toidentify a CUSIP for a particular security. One having ordinary skill inthe art will appreciate that a CUSIP could be represented as a datavector comprising the particular ticker, coupon, and maturity associatedwith the CUSIP in question as data items.

8) CINS Number or CINS: A number used to identify all internationalstocks and registered bonds. A security's CINS can be identified by itsticker, coupon, and/or maturity.

9) Ticker Domain: a region in an email that is associated with aparticular ticker identified in the email, wherein if a token is locatedin this region, it is associated with the particular ticker.

10) Context: A set of information relating to a particular security, theinformation including identifying data, such as the security's ticker,coupon, maturity, and bid and/or offer prices, wherein the identifyingdata can be used, among other things, to identify one or more CUSIPnumbers that correspond to the identifying data.

II. Description

The preferred embodiment of this invention is described in the contextof processing emails containing information relating to bonds, whereinthe bonds are identified by their CUSIP number. However, one havingordinary skill in the relevant art will appreciate that the disclosedsystem and method can be readily adapted to process data transmitted indifferent manners besides email. For instance, the data can be in theform of a regular text file, an image file that has been converted to atext file, or any type of file that can be parsed by a computer toextract text information. One having ordinary skill in the relevant artwill also appreciate that the disclosed system and method can be readilyadapted to process data besides bonds, including other security types,such as stocks and/or mutual funds. Additionally, the disclosed systemand method can be readily adapted to search for other types ofidentifiers besides CUSIP numbers, such as CINS numbers, or any othermeans to identify data, without departing from the scope of thisinvention.

Prior to discussing the details of the preferred embodiment, an exampleof a portion of an email received by a trader will be explained.Consider the following example, shown in Table 1 below, of an excerptfrom an email received by a trader.

TABLE I IBM 07/05 8 100/ 04/07 10 /230

In Table 1, the first line refers to a single security. The letters“IBM” refer to the ticker relating to the security. “07/05” refers to amaturity month and year of the bond, “8” refers to the coupon, or rateof the security, and the “100/” is the bid price because it is followedby a “/”. The second line refers to another security with the sameticker. The “04/07” refers to the security's maturity month and year,the “10” refers to the coupon, and the “/230” refers to the offer pricebecause it is preceded by a “/”.

Needless to say, most emails are not this structured, but contain thesame or similar types of information. The details of how the preferredembodiment of the invention processes all of these emails, whetherstructured or not, will not be set forth, beginning with reference toFIG. 1.

FIG. 1 depicts a preferred hardware arrangement implementing the presentinvention. In FIG. 1, a server computer 101, either containing adatabase 102, or being in communication with a database 102, is incommunication via communication mechanism 104 with one or moreworkstation computers 103. Any method of communicating between computersmay be used between the server 101 and the workstations 103, and theserver 101 and the database 102, if not contained within the server 101.The communication mechanism 104 need not be a hardwired network, and maybe wireless, or a combination of both. Workstations 103 do not have tobe actual desktop computers, as shown in FIG. 1, and can be other typesof computers, such as laptops, hand-held devices, or any device thatincludes a computer.

In the preferred embodiment, the database 102 stores all of the emailsreceived from clients, typically via the Internet, and it also stores alist of all bonds, including their ticker, coupon, maturity, and CUSIPnumber. Traders have access to workstations 103, where they can login toaccess their particular account. Logging in includes communication withthe server 101 to transmit that particular trader's information to thetrader's workstation 103. Also according to the preferred embodiment,the present invention is implemented as a program stored on the server101, where it is executed to process the received emails and extract thebond CUSIP numbers and their bid and/or offer prices. However, theprogram can be stored on one or all of the workstations 103, andexecuted from any location. Also, it is possible to have the program anddatabase all stored on a single computer.

The manner of processing the emails according to the preferredembodiment of this invention will now be described with reference toFIGS. 2-5. FIG. 2 provides a high level view of the entire processperformed by this embodiment. The subsequent figures, FIGS. 3-5, providemore detail regarding 207 shown in FIG. 2. With reference to 201 in FIG.2, the list of bonds, containing each bond's ticker, coupon, maturity,and CUSIP number, is initially downloaded from the database 102 into thelocal memory of the computer performing the email processing, such asthe server 101. Next, it is determined whether or not any unprocessedemails exist in the database 102. If none exist, it is determined thatall of the emails have been processed at 203 and any bond CUSIPs andtheir corresponding bid and/or ask prices that have been identifiedthrough the email processing are stored in the database 102 and outputvia email to the traders at 204.

If unprocessed emails remain in the database 102, the next of thoseemails is downloaded from the database 102 and stored in local memoryfor processing at 205. Some initial preprocessing of the downloadedemail is performed at this time to eliminate the header of the emailand, optionally, to store general statistical information about theemail, such as storing, the number of occurrences of the word “bid” and“offer” that are present in the email. The statistical information maybe useful in identifying bid and/or ask prices included in the email.

Next, a map of all of the tickers in the current email is generated at206. The map stores each ticker name found in the email as well as itsposition in the email. This map will subsequently be used to determinewhich ticker a particular token belongs.

To identify a ticker, the preferred embodiment processes the emailline-by-line. Before looking for tickers in a line, the line ispreprocessed to correct formatting issues, such as making instances of“2×2” and “2×2” uniform. After the line has been preprocessed, the lineis parsed one word at a time, comparing each word to the list of tickersprovided by the downloaded bond list data 201. If the word matches aticker, several checks are executed to determine if it is in fact, not aticker, even though it matches one in the list. In particular, if theword is preceded or succeeded by a “/” or a “,” followed by anotherword, such as “FON/AWE” or “FON,AWE”, then the word is determined not tobe a ticker, and it is skipped for the next word. Also, if it is a wordlike “cash bonds”, “AT +344”, “AT /+344”, or “AT $544” it is determinedthat the word is not a ticker, and it is ignored.

If the word that matches a ticker in the list is not eliminated by theabove-described checks, the word is determined to be a ticker and itsposition in the line of the email is recorded. If the ticker is locatedat the start of a set of data, then the position of the ticker is notadjusted. An example of a ticker located at the start of a set of datais shown in Table II below, wherein “FON” is the ticker.

TABLE II FON 6.25 11 360-370

If the ticker is located to the right of a start of a set of data, asshown for example in Table III below wherein “BA” is the ticker, thenthe position of the ticker is chosen to be the first word in the linethat matches a word in the issuer's name for that ticker, i.e.,“BOEING”. Issuer names can be provided with the downloaded bond data201.

TABLE III BOEING CAPITAL C(BA) 5.65 05/06 65-60

After the map of tickers is built, it is preferable to adjust the mapsuch that if the left-most ticker in a line is not at position zero,then the positions of the tickers in that line are shifted to the leftso that the left-most ticker in the line is at position zero. Thissimplifies subsequent processing.

At this point, a map of all tickers in the current email is generated,completing the processing described at 206 in FIG. 2. After this, theemail is then processed line-by-line at 207 in an attempt to extractbond CUSIP numbers and the corresponding bid/offer prices. After all ofthe lines of the current email have been processed, it is determinedthat the current email has been completely processed at 208, and theprocess repeats by checking the database 102 for a next unprocessedemail at 202.

Now the manner in which the email is processed line-by-line in anattempt to extract bond CUSIP numbers and the corresponding bid and/oroffer prices at 207 will be described with reference to FIG. 3. At 301,it is determined whether a next, unprocessed line in the email exists.If not, execution proceeds to 208 in FIG. 2, where it is decided thatthe current email has been completely processed. If an unprocessed linedoes exist, it is marked for processing at 302. In other words, theunprocessed line is identified by a pointer, a corresponding arrayposition, or loaded into a local variable, etc. The marked email linebecomes the “current” email line for processing.

Next, it is determined whether the current email line is a data line at303. Data line means that the current email line has data that couldrepresent a coupon, maturity, or a bid and/or offer. To determine if thecurrent email line is a data line, this embodiment of this inventionchecks for data having the format of coupons, maturities, bids oroffers. For instance, the current email line must contain numbers to bea data line. Otherwise, no coupon, maturity, bid or offer is assumed tobe present. Also, numbers having a “/” between them could be bid andoffer. If numbers having the format of a coupon, maturity, bid, or offerare found, the current line is determined to be a data line andprocessing of the line continues at 304. Otherwise, it is determined notto be a data line, and the current line is skipped. Execution thenproceeds to 301 to check for a next unprocessed email line.

At 304, a list of tokens in the line is prepared. For each token found,its position in the line is recorded. Preparing the list of tokens isachieved by searching the line for numbers and numbers separated by a“/” or a “-”. Numbers separated by a “/” or a “-” are considered asingle token. Table IV below shows examples of tokens, wherein each rowin the table represents a single token.

TABLE IV 05/06 5.65 10 100/ 90-100

After a list of tokens has been prepared for the current line at 304,the tokens in the line are processed to determine if they are coupons,maturities, or bids and/or offers at 305. If tokens are identified asmaturities, or if all of the tokens in a line have been processed, anattempt is made to identify one or more CUSIPs for the ticker andrelated token(s) that have been identified. This process is discussed inmore detail below, with reference to FIGS. 4 and 5. However, beforediscussing this process, it is helpful to first define the usage of theterms “context” and “ticker domain”, which will be used throughout theremainder of this description.

A “context” is a set of stored information relating to a particularbond. This set of information includes identifying data, including thebond's ticker, coupon, and maturity, which are used to attempt toresolve a CUSIP for the particular bond. An example of two contexts isshown in Table V below.

TABLE V Context 1: Ticker: BA Coupon: 5.65 Maturity: 05/06 Bid Price:100 Offer Price: 95 Context 2: Ticker: BNI Coupon: Null Maturity: 12/05Bid Price: 104 Offer Price: Null

Although Table V shows five data fields for ticker, coupon, maturity,bid price, and offer price, the context may include more than these datafields. When a token relating to a particular context is identified as acoupon, maturity, or bid price and/or offer price, the token's data isthen stored in the corresponding field of the context. For example, if acurrent token pertaining to the ticker BNI has the data 6.375, and suchdata has been identified as a coupon, the null value for the couponfield in context 2 will be replaced with 6.375.

Each context relates to one of the tickers located in the email, asmapped at 206 in FIG. 2. It is possible however, to have more than onecontext relating to a single ticker in the situation where several setsof coupons and maturities are described with reference to a singleticker. When the context for a particular ticker is initialized, thedata fields for coupon, maturity, bid price, and ask price are set toNULL. As data for these fields are extracted from the tokens in theemail, their NULL values are replaced with the newly extracted data.

A “ticker domain” is a mechanism by which a token is associated with aparticular ticker, and consequently, a particular context. This allowsthe data from the token to be placed in the appropriate context. Forexample, if an email contains the lines shown in Table VI below,

TABLE VI BNI 6.375 12/05 100-95  CSX 7.25 05/04 105-100

the tokens 6.375, 12/05, and 100-95 are all in the BNI ticker domain,and their data will be stored in the context for the BNI ticker. Thetokens 7.25, 05/04, and 105-100 are all in the CSX ticker domain, andwill be stored in the corresponding context. The manner in which a tokenis associated with a ticker domain will be described later.

With a context and ticker domain defined, the processing of the tokensin a line will now be described with reference to FIG. 4, which is anexploded view of 305 in FIG. 3. The first action performed in processingthe tokens in the current email line is to determine whether anyunprocessed tokens exist in the current line at 401. If all of thetokens in this line have been processed, an attempt is made to resolve aCUSTP for the current context 402. The current context is the contextfor the ticker to which the previous token applied. In other words, ifthe previous token was a coupon for ticker “BA”, the current context isthe context for ticker “BA”. It is noted that because the current linehas been determined to be a data line, 303 in FIG. 3, at least one tokenexists in this data line, thereby preventing the scenario where the linehas no token.

After attempting to resolve the CUSIP for the current context, themanner of which be explained in more detail later when discussing FIG.5, the current line's processing is complete, and the process returns to301 in FIG. 3, where the email is checked for another unprocessed line.If more unprocessed tokens exist in the current line, the nextunprocessed token is selected as the current token and a check is madeto determine if the current token is in a new ticker domain at 403. Thisis performed by searching for a ticker between the position of theprevious token and the current token. If a ticker is found between theprevious token's position and the current token's position, it isdetermined that the current token is in a new ticker domain. If no newticker is found, it is determined that the current token is in theprevious ticker domain and the token data, if resolved, is added to thecontext pertaining to that ticker. If no new ticker is found, theprocess proceeds to 501 in FIG. 5.

If a new ticker has been found at 404, it is determined that the currenttoken refers to the new ticker, i.e., that it is in the new ticker'sdomain and a context for the new ticker should be initialized.Therefore, the previous context referring to the previous ticker will beprocessed. This begins with an initial check for whether a previouscontext exists at 405, i.e., whether this newly found ticker is thefirst ticker in the email. If a previous context does not exist, i.e.,this is the first ticker, then a first context is initialized for thenew ticker at 407. For example, if “BA” is the first ticker in theemail, a first context will be initialized as shown in Table VII below.

TABLE VII Context 1: Ticker: BA Coupon: Null Maturity: Null Bid Price:Null Ask Price: Null

If a previous context does exist, i.e., there have been previoustickers, then an attempt is made to resolve a CUSIP for the contextcorresponding to the previous ticker at 406, the process of which willbe described later. After the attempt to resolve the CUSIP for theprevious ticker has been made, a context is initialized for the newticker at 407.

Whether or not a previous context existed at 405, the process ultimatelymoves to 501 in FIG. 5, wherein an attempt is made to resolve thecurrent token.

The first step is to determine whether the current token is a coupon at501. This step is performed by analyzing the current token with respectto the current context. (Note that, although the following analysis isdescribed in an order, such order is not necessarily required.) First, acheck is made to find out if a coupon already exists in the currentcontext, i.e., the coupon data field in the current context is not equalto null. If so, it is determined that the current token is not a coupon,and the processing moves on to 505 in FIG. 5. If a coupon does not existin the current context, then the token is formatted to be in decimalform, if it is a fraction. This formatting simplifies subsequent dataprocessing. Other formatting of the token may be performed to ensurethat the token has the proper format of a coupon. Then, the formattedtoken is compared to the coupons in the list of bond data 201 to makecertain that the formatted token has a value less than or equal to thatof the maximum coupon value in the list of bond data. If the formattedtoken is greater than the maximum coupon value, it is determined thatthe current token is not a coupon, and processing proceeds to 505.

If (1) the formatted token is less than or equal to the maximum couponvalue in the list of bond data, (2) a maturity exists in the currentcontext, and (3) if the current token cannot be a bid or an offer(discussed below), then it is determined that the current token is infact a coupon. As such, the token is deemed to be resolved, it is storedin the coupon field of the current context at 502, and processingproceeds to the next token at 401 in FIG. 4. Otherwise, several moreanalyses are performed on the token before concluding that it is or isnot a coupon.

If the current token, in its preformatted form, i.e., its original form,is a number with a fraction, such as “11¼”, then the current token isdetermined to be a coupon, and is stored as such in the current contextat 502 and processing proceeds to the next token at 401 in FIG. 4. Ifthe current token is preceded by a single quote, a “/”, a “-”, or a “0”,it is determined not to be a coupon, and processing proceeds to 505.

If it is still undetermined whether or not the current token is acoupon, the preferred embodiment of this invention then looks at thenext token in the line to determine if it is a maturity at 503 with theassumption that the current token is a coupon. In other words, the nexttoken is used to provide more information about the current token. Ifthere is no next token, it cannot be a maturity and the current token isdetermined not to be a coupon, and processing continues at 505. If thereis a next token, it is determined whether the next token is in anotherticker's domain, and consequently whether it would apply to a newcontext instead of the current context. If the next token is in anotherticker's domain, the current token is determined not to be a coupon, andprocessing continues at 505. Also, if a maturity already exists in thecurrent context, then it is determined that the next token is not amaturity and the current token is not a coupon. In this case, processingalso continues at 505. Further, if the next token is a number and afraction, it is determined that it is not a maturity and that thecurrent token is not a coupon. Processing then proceeds to 505.

If after all of this analysis, the next token has not been resolved as amaturity, the next token is checked for compliance with a date format.If the next token is of the format MM/YY or YY or MM/YYYY, where Ms arenumbers defining a month and Ys are numbers defining a year, or if thenext token is a two digit integer preceded by a single quote, such as“'04”, then the next token is determined to have a date format. The nexttoken may be preprocessed to remove day fields. For instance, a maturityof “12/5/04” can be preprocessed to be in the form 12/04. Although theseparticular formats are the preferred formats for a maturity date, onehaving ordinary skill in the relevant art will appreciate that the keypoint here is determining whether the next token has a date format. Ifthe next token does not have a date format, it is determined not to be amaturity, and processing continues to 505, the current token still beingunresolved.

If the next token does have a date format, the next token is parsed tolook for data that could not relate to a date, such as the numberthirteen in a position where a month would be located, or a dollar sign.If it has any of these characteristics, the next token is determined notto be a maturity, and the current token not a coupon. Processing thenproceeds to 505. If the next token does not have any characteristic thatwould eliminate it from being a maturity, it is resolved as a maturity,and consequently, the current token is resolved as a coupon. Both thenext token, resolved as a maturity, and the current token, resolved as acoupon, are stored in their respective fields in the current context at504. In this case, processing proceeds to 507 for an attempt to resolvea CUSIP for the current context.

Anytime a token has been resolved as a maturity, as just described, anattempt is made to resolve a CUSIP for the current context. The attemptto resolve a CUSIP is performed by comparing the data in the context atissue with the data in the bond list downloaded at 201 from the database102. The ticker, coupon, and maturity data fields in the context atissue are compared with the CUSIPs in the bond list having the sameticker, coupon, and maturity. If the coupon field in the context atissue has a null value, all CUSIPs having the same ticker and maturityas the context are identified. If the maturity field in the context atissue has a null value, all CUSIPs having the same ticker and coupon asthe context are identified. (This scenario could occur if no token in aline resolved as a maturity, at 402 in FIG. 4.) All identified CUSIPsare stored for later output, and may be stored in a data field in thecontext itself.

In the case where one or more CUSIPs cannot be identified, processingproceeds normally, without any identified CUSIPs having been stored forlater output. In the particular situation where an attempt to resolve oridentify a CUSIP has been made after 507 in FIG. 5, processing continueson to the next token at 401 in FIG. 4.

Turning now to 505 in FIG. 5, if it was determined that the currenttoken was not a coupon, the current token is then analyzed to determineif it is a maturity. If the current token is determined not to be amaturity, processing continues to 508. The manner in which the currenttoken is determined to be or not be a maturity will now be described.

If a maturity exists in the current context, a decision is made that thecurrent token cannot be a maturity. If the token is a number with afraction, then it is determined not to be a maturity because date fieldsare not of this format. Also, the token must be able to resolve into adate format to be a maturity, and if it cannot, it is decided that it isnot a maturity. As discussed with reference to 503, the preferred dateformats are MM/YY or YY or MM/YYYY, with day fields having beenpreprocessed out of the token. If the current token does not have a dateformat, it is determined not to be a maturity, and processing continuesto 508.

If the current token does have a date format, it is parsed to find datathat could not relate to a date, such as the number 13 in a positionwhere a month would be located, or a dollar sign. If it has anycharacteristic that would prevent it from being a date, the currenttoken is determined not to be a maturity. If the current token does nothave any characteristic that would eliminate it from being a maturity,it is determined to be a maturity. In this case, the token is stored asa maturity in the current context at 506. Also, since a token has beenresolved as a maturity, an attempt is made to resolve a CUSIP for thecurrent context at 507. After the attempt, the current token having beenresolved as a maturity, processing of the next token begins at 401 inFIG. 4.

If the current token is not a coupon (501) or a maturity (505), it isdetermined whether it is a bid and/or offer at 508. A token that is tobe a bid and/or an offer must have the following preferred formats: “N”,“N/”, “/N”, or “N/N”, where N represents a number. Whitespace can bebefore or after each N or “/”, and each “/” can be replaced with a “-”.Also, any tokens of this form that begin with a preceding zero aredetermined not to be bids and/or offers because usually maturities beginwith a zero. Examples of tokens that can be bids and/or offers are shownbelow in Table VIII.

TABLE VIII 100/ −90 60/65

In Table VIII, the “100 /” is a bid, the “−90” is an offer, and the“60/65” is an example of a token that includes both a bid and an offer,where the “60” is a bid and the “65” is an offer. Therefore, it isdecided that the current token includes a bid if it is a number followedby a “/” or a “-”, excluding whitespace. Also, if it is a number greaterthan or equal to 50 and is followed by the word “bid”, it is determinedto include a bid. Alternatively, it is determined that the current tokenincludes an offer if it is a number preceded by a “/” or a “-”, or if itis a number that is greater than or equal to fifty and is followed bythe word “offer”. A further optional way to help determine if the tokenincludes a bid or an offer is to compare the number of total instancesof the word “bid” or “offer” are present in the email with the numberthat have been processed.

If it is calculated that the current token includes a bid and/or anoffer, the bid and/or offer data in the token is stored in thecorresponding field(s) of the current context at 509. After storage,processing continues to the next token in the current email line at 401in FIG. 4.

If it is calculated that the current token does not include a bid or anoffer, the current token remains unresolved, and processing alsocontinues to the next token in the current email line at 401 in FIG. 4.

Processing of the subsequent tokens in the line are the same as theprocess just described. Further, all of the tokens in the current lineare processed, then each subsequent email line is processed (207 in FIG.2), and when the email is completely processed (203 in FIG. 2), thestored security identifiers (CUSIPs) and their corresponding tradeinformation, including bid price and/or offer price, are output at 204in FIG. 2.

III. EXAMPLE

The processing depicted in FIGS. 3-5 will now be described with respectto an example. Suppose the line of an email shown in Table IX below isloaded for processing at 302 in FIG. 3.

TABLE IX BAT 5.5 04/04 65-80 HHH 6.5 06 80-90

At 303, it is determined that the line shown in Table IX is a data linebecause it contains at least the number 5.5, which could be a coupon,and the process then proceeds to 304 to prepare a list of tokens in thisline. A token is considered to be a number or numbers separated by a “/”or a “-”, and accordingly, the following tokens will be extracted fromthe line shown in Table IX: “5.5”, “04/04”, “65-80”, “6.5”, “06”, and“80-90”. The positions of each of these tokens in the line will also berecorded as “4”, “8”, “14”, “24”, “28”, and “31”, respectively, if theinitial position in the line is considered to be zero.

At 305 in FIG. 3, which is elaborated upon in FIGS. 4 and 5, each ofthese tokens is processed as follows. At 401 in FIG. 4, it is determinedthat there are more tokens to process in this line because the sixunprocessed tokens “5.5”, “04/04”, “65-80”, “6.5”, “06”, and “80-90”remain. At 403, the first token, “5.5” is selected. Because this is theinitial token, and in the case of this example, it is assumed to be theinitial token in the email, the initial ticker “BAT” is identified as anew ticker at 404. Because “BAT” is the initial ticker and “5.5” is theinitial token, no previous context is determined to exist at 405, and acontext is initialized for ticker “BAT” at 407. This context isinitialized as shown in Table X below.

TABLE X Context 1: Ticker: BAT Coupon: Null Maturity: Null Bid Price:Null Ask Price: Null

At 501, the process of attempting to determine if the current token“5.5” is a coupon begins. First, the current context, context 1 shown inTable X, is checked to see if a coupon already exists in the context.Because the coupon field in context 1 has a value of “Null”, no couponis determined to exist for this context and processing continues.

Next, it is determined if (1) the current token is less than or equal tothe maximum coupon value in the list of bond data, (2) if a maturityexists in the current context, and (3) if the current token cannot be abid or an offer, and if all three of these determinations are true, thecurrent token is determined to be a coupon. However, since a maturitydoes not exist in context 1, this check fails and processing continues.

Next, it is determined whether the current token “5.5” is a numberfollowed by a fraction or if it is preceded by a single quote, a “/”, a“-”, or a zero. If it is a number followed by a fraction or if it ispreceded by a single quote, a “/”, a “-”, or a zero, it is determinednot to be a coupon. However, “5.5” is not a number and a fraction, suchas “5½”, and it is not preceded by a single quote, a “/”, a “-”, or azero, and processing continues.

Because the current token has not been resolved as a coupon as of yet,the next token “04/04” is checked to determine if it is a maturity at503. But first, an inquiry is made as to whether the next token “04/04”is in a new ticker domain. However, since a new ticker is not betweenthe position of the next token “04/04” and the position of the currenttoken “5.5”, as shown in Table IX, it is decided that the next token isnot in a new ticker domain. Further, because the maturity field incontext 1 is “Null”, as shown in Table X, it is decided that a maturityfor this context does not exist, and processing continues.

The next attempt to determine whether the next token “04/04” is amaturity includes checking it for compliance with a date format. Because“04/04” fits into a MM/YY format, where “M” represents a month digit and“Y” represents a year digit, and because “04/04” does not have anycharacteristics that would prevent it from being a valid date, it isresolved as a maturity and the current token “5.5” is resolved as acoupon. Therefore, the current token “5.5” is stored as a coupon in thecurrent context, context 1, and the next token “04/04” is stored incontext 1 as a maturity at 504 in FIG. 5 and as shown in Table XI below.

TABLE XI Context 1: Ticker: BAT Coupon: 5.5 Maturity: 04/04 Bid Price:Null Ask Price: Null

At 507, an attempt to match one or more CUSIPs to the data in context 1is made. That is, if any CUSIPs for ticker “BAT” with a coupon of “5.5”and a maturity of “04/04” exist, they will be identified and stored forlater output. The CUSIP(s) that match the data in the current contextmay optionally be stored in the context itself. Whether or not one ormore CUSIPs are identified, processing continues back to 401 in FIG. 4to check for more unprocessed tokens.

The next unprocessed token is “65-80” as shown in Table IX, which isselected at 403. Since no new ticker is located between this token andthe previous token, processing proceeds from 404 to 501 in FIG. 5, andthe current token “65-80” is determined to be in the ticker domain of“BAT” and to apply to context 1

At 501 an attempt is made to resolve the current token “65-80” as acoupon. However, since a coupon already exists in context 1, as shown inTable XI, it is determined that the current token is not a coupon andprocessing proceeds to 505 to determine if it is a maturity. Similarly,because the current context includes a maturity, as shown in Table XI,the current token “65-80” is determined not to be a maturity andprocessing proceeds to 508 to check if it can be a bid and/or an offer.

At 508, the current token “65-80” is compared to the following bid/offerformats: “N”, “N/”, “/N”, or “N/N”, where N represents a number.Whitespace can be before or after each N or “/”, and each “/” can bereplaced with a “-”. Also, bids and offers may not begin with apreceding zero. Because “65-80” has the format “N-N” and does not beginwith a preceding zero, it is resolved as a bid and an offer and storedas such in the current context, context 1, as shown in Table XII below.

TABLE XII Context 1: Ticker: BAT Coupon: 5.5 Maturity: 04/04 Bid Price:65 Ask Price: 80

After storage of the bid and offer prices in context 1, processingcontinues back to 401 in FIG. 4 to find more unprocessed tokens in thisline. The next unprocessed token is “6.5” as shown in Table IX. At 403,this token is selected as the current token, and the processing beginsfor determining what ticker domain this token belongs. To determine ifthe current token “6.5” is in a new ticker domain, a check is made for aticker between the current token “6.5” and the previous token “65-80” at403. As shown in Table IX, the ticker “HHH” is between these tokens, andan answer of “yes” is returned at 404. Context 1 now becomes theprevious context at 405, and another attempt to identify one or moreCUSIPs for context 1 is made at 406. After checking for CUSIPs at 406, anew context, context 2 is initialized as shown in Table XIII below.

TABLE XIII Context 2: Ticker: HHH Coupon: Null Maturity: Null Bid Price:Null Ask Price: Null

The processing of the current token “6.5” and the remaining tokens “06”and “80-90” with respect to context 2 are processed in the same manneras the first three tokens were processed with respect to context 1 andwill not be further described. Once processing of the email is complete,the CUSIPs identified for each context, if any, along with any resolvedbid and/or offer prices pertaining to each context are output. Accordingto experimental data, the invention extracts bond information from anassortment of emails having varying degrees of structure, 60% of thetime, with 5-7% being false positives.

It is to be understood that the above-described embodiment and exampleis merely illustrative of the present invention and that many variationsof the above-described embodiment and example can be devised by oneskilled in the art without departing from the scope of the invention.For example, this system and method could easily be modified to scanpartially unstructured documents for other information besides CUSIPnumbers, and could be used, for instance, to scan email for SPAM, checkfiles for viruses, or routing messages without specific addresses. It istherefore intended that any such variations and their equivalents beincluded within the scope of the following claims.

1. A method executed by a computer for identifying at least one of aplurality of predefined data vectors from partially unstructured data,said partially unstructured data comprising a plurality of data itemshaving positions relative to each other in said partially unstructureddata, the method comprising: determining a position of each of one ormore data items of a first type from the plurality of data items in thepartially unstructured data, wherein a data item of the first type isrepresentative of a ticker; selecting a data item of a second type fromthe plurality of data items in the partially unstructured data, whereina data item of the second type is representative of a coupon or amaturity; selecting one of the one or more data items of the first typebased on its position relative to the selected data item of the secondtype; and identifying a predefined data vector from the plurality ofpredefined data vectors using the selected data item of the first typeand the selected data item of the second type, wherein the predefineddata vector is representative of a CUSIP for a particular security. 2.The method of claim 1, further comprising: selecting a data item of athird type from the plurality of data items in the partiallyunstructured data, wherein a data item of the third type isrepresentative of a coupon or a maturity, whichever the data item of thesecond type is not, and wherein identifying the predefined data vectorusing the selected data item of the first type and the selected dataitem of the second type identifies the predefined data vector using theselected data item of the first type, the selected data item of thesecond type, and the selected data item of the third type.
 3. The methodof claim 2, further comprising: selecting a data item of a fourth typefrom the plurality of data items in the partially unstructured data,wherein the data item of the fourth type is representative of tradeinformation.
 4. The method of claim 3, further comprising: outputting(a) an the identified predefined data vector and (b) the data item ofthe fourth type.
 5. The method of claim 4, further comprising: storingin a context: (a) the selected data item of the first type, (b) theselected data item of the second type, (c) the selected data item of thethird type, (d) the selected data item of the fourth type, and (e) theidentifier. 6-9. (canceled)